Overview

sniper.gif pubg_map.jpg
Left: A skilled sniper taking out a moving target. Right: A PUBG map.

Motivation

Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. Players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.

We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). Given a set of player and match characteristics, how well can we predict each player’s end placement? Can we predict who is going to win the game?

Through our analysis, we aim to understand what characterizes the top finishers: How aggressive are their playing styles? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.

Initial Questions

The main goal of this project is to predict a solo player’s finish placement based on their in-game actions. Specifically, the three subquestions of interest we began with are:

  1. Clustering: Which playing styles are most successful?

  2. Model Prediction: How well can we predict a player’s finish placement? How well can we classify winners?

  3. Feature Ranking: What player actions or statistics are most predictive of their finish placement?

For question #1, we unfortunately found that the data we have didn’t lend itself well to clustering playing styles. For example, data on where a player chooses to land on the map would be indicative of how aggressive a player is at the beginning of the match – landing in a city with plenty of looting items, where you are likely to encounter many other players, versus landing in a remote area, where you are likely not going to. Data on when a player kills another player would also have been insightful – a consistently aggressive winner would likely have kills throughout the game, while a more passive winner would only kill towards the end of the game, when it is necessary to do so.

Due to the limitations of the data with regards to clustering, we instead did more exploratory analysis with dimension reduction (principal components analysis) to help us better understand the features and improve our answers to question #2 and #3. Therefore, we focused our analysis on questions #2 and #3.

Data

The data comes from the Kaggle competition.

  • For TA use: Run the code chunk below to download the data from the Dropbox link. This link is guaranteed to be available only during the grading period.
  • For everyone else: Join the Kaggle competition and run the shell script download_data.sh.
data.url <- paste0("https://www.dropbox.com/s/mp89gp57cz2dsc7/train_V2.csv.zip?dl=1")

if(!file.exists("./data/train_V2.csv.zip")){
  download.file(data.url, destfile = "./data/train_V2.csv.zip", mode = "wb")
}
# Warning: Large dataset (628 MB), will take a minute or so to read.
raw_dat <- read_csv("./data/train_V2.csv.zip")

clean_dat = raw_dat %>%
  clean_names() %>%
  drop_na(win_place_perc) # Drop rows without outcome variable

Variables

Each row in the data contains one player’s post-game stats. A description of all data fields is provided in data/pubg_codebook.csv. We will focus on the solo game mode (match_type = solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 16% of the data, with 720,386 observations. The outcome variable we are trying to predict is win_place_perc.

solo_dat <- clean_dat %>% 
  filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
  select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%        # Remove features that are not relevant to solo mode
  mutate(kill_points = ifelse(rank_points == -1 & kill_points == 0, NA, kill_points), # Following codebook explanations
         win_points = ifelse(rank_points == -1 & win_points == 0, NA, win_points), 
         rank_points = ifelse(rank_points == -1, NA, rank_points),
         id = as.factor(id), 
         match_id = as.factor(match_id)) %>%
  select(-kill_points, -win_points, -rank_points) # Variables being deprecated

The variable num_groups did not match the actual number of groups we had in our data, so we replace the variable with the number of players we have in the match and call it n_groups.

match_n_groups = solo_dat %>%
  group_by(match_id, num_groups) %>%
  count() %>%
  ungroup()

head(match_n_groups)
# A tibble: 6 x 3
  match_id       num_groups     n
  <fct>               <int> <int>
1 0002912fe5ed71         92    95
2 00086e740a5804         95    98
3 001616ed5da99b         94    97
4 001937f739426c         92    95
5 001cd8e7e6b737         17    24
6 001eeedf57047a         97    99
match_n_groups = match_n_groups %>%
  select(-num_groups)

solo_dat = solo_dat %>%
  full_join(. , match_n_groups, by = "match_id") %>%
  select(-num_groups) %>%
  rename(n_groups = n)

Additionally, we remove any matches in which we have more observations than the max number of players as this is not possible.

# Compute proportions
prop_data = solo_dat %>% 
  group_by(match_id, max_place, match_duration) %>% 
  count() %>%
  ungroup() %>%
  mutate(prop = n/max_place,
         remove_game = prop > 1) 

print("Proportion of players we have data for (out of max_place)")
[1] "Proportion of players we have data for (out of max_place)"
summary(prop_data$prop) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9091  1.0000  1.0000  1.0010  1.0000  1.7059 
# Remove games with proportion greater than 100%
remove_match_ids = prop_data %>%
  filter(remove_game) %>%
  pull(match_id)

solo_dat = solo_dat %>%
  filter(!(match_id %in% remove_match_ids))

There are 37 matches for which this was the case.

Training and Test Set

We are given a training set and a test set. The outcome variable for the test set will not be provided until the end of the Kaggle competition on Jan. 30th, 2019. Therefore, for this project, we will only be using the provided training set. Due to computational costs in fitting models, we will only use observations from 2000 of the matches (23% of the available data). Within this set, we create our own “training” (60%), “validation” (20%), and “test” set (20%). We split observations by match rather than player, so all players from the same match will be in the same set (“training”, “validation”, or “test”).

For the rest of our analysis, the “training set” refers to the one we’ve created.

# Sample 2000 matches
set.seed(1)
matches = sample(solo_dat$match_id, 2000, replace = FALSE)
solo_dat = solo_dat %>% filter(match_id %in% matches)

# Split into train and test
train_ind = sample(c(T,F), prob = c(0.6, 0.4), size = length(unique(solo_dat$match_id)), replace = TRUE)
train_solo = solo_dat %>%
  filter(match_id %in% unique(solo_dat$match_id)[train_ind])
temp_solo = solo_dat %>%
  filter(!(match_id %in% unique(solo_dat$match_id)[train_ind]))

# Split into validation and test
test_ind = sample(c(T,F), prob = c(0.5, 0.5), size = length(unique(temp_solo$match_id)), replace = TRUE)
val_solo = temp_solo %>%
  filter(match_id %in% unique(temp_solo$match_id)[test_ind])
test_solo = temp_solo %>%
  filter(!(match_id %in% unique(temp_solo$match_id)[test_ind]))
glimpse(train_solo)
Observations: 99,040
Variables: 20
$ id               <fct> 73348483a5974b, 677089d4226510, cfa11f07d8069...
$ match_id         <fct> 85601fe44d519b, 06f853dab32a20, 4563c0409f911...
$ boosts           <int> 0, 0, 3, 0, 1, 4, 6, 5, 0, 2, 0, 0, 1, 3, 0, ...
$ damage_dealt     <dbl> 17.81, 46.41, 76.44, 19.35, 161.80, 216.20, 3...
$ headshot_kills   <int> 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, ...
$ heals            <int> 0, 0, 2, 0, 0, 1, 2, 2, 0, 1, 0, 0, 6, 1, 0, ...
$ kill_place       <int> 79, 81, 50, 96, 15, 12, 5, 2, 30, 19, 37, 86,...
$ kills            <int> 0, 0, 0, 0, 2, 2, 4, 8, 1, 2, 1, 0, 2, 1, 0, ...
$ kill_streaks     <int> 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, ...
$ longest_kill     <dbl> 0.000, 0.000, 0.000, 0.000, 52.750, 79.950, 6...
$ match_duration   <int> 1945, 1493, 1441, 1923, 1349, 1435, 1882, 138...
$ max_place        <int> 99, 95, 95, 96, 94, 95, 97, 98, 91, 96, 89, 9...
$ ride_distance    <dbl> 129.3, 0.0, 1054.0, 0.0, 0.0, 299.6, 0.0, 0.0...
$ road_kills       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ swim_distance    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0...
$ vehicle_destroys <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ walk_distance    <dbl> 471.90, 50.82, 1542.00, 20.34, 536.90, 2363.0...
$ weapons_acquired <int> 3, 1, 2, 1, 6, 4, 3, 5, 4, 3, 1, 2, 5, 3, 2, ...
$ win_place_perc   <dbl> 0.2245, 0.1489, 0.8830, 0.0000, 0.6667, 0.925...
$ n_groups         <int> 99, 95, 95, 96, 94, 95, 97, 98, 91, 96, 89, 9...

In the training set, we have 99040 players and 1058 matches.

Exploratory Data Analysis

Distribution of Features by Finish Percentile

We first explored the distribution of each feature by the final finish percentile. Players were grouped into the 0-19th, 20th-39th, 40th-59th, 60th-79th, or 80th-100th percentile finish. Then we plotted the density of features by percentile groups. This gives some indication of how predictive our features will be. Note that due to extreme outliers, we excluded the highest 1% of many of the features for clearer visualizations.

filter_vars = c("boosts", "damage_dealt", "headshot_kills", "heals", "kills", "longest_kill", "ride_distance", "swim_distance", "walk_distance", "weapons_acquired")
train_solo %>% 
  filter_at(vars(filter_vars), all_vars(. < quantile(., 0.99, na.rm = T))) %>%   # Remove outliers
  rename_at(vars(filter_vars), ~ sapply(filter_vars, function(str) paste0(str, "*"))) %>% # Mark variables for which we removed outliers with asterisk features
  mutate(win_place_cat = floor(win_place_perc / 0.2),
         win_place_cat = ifelse(win_place_cat == 5, 4, win_place_cat),
         win_place_cat = as.factor(win_place_cat)) %>%
  gather("feature", "value", -match_id, 
         -id, -win_place_perc, -win_place_cat) %>%
  ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
  facet_wrap(feature ~., scales = "free") +
  geom_density() +
  labs(title = "Distribution of Features by Finish Percentile", 
       caption = "* Removed outliers (> 99th percentile) from this feature's density plot",
       x = "Value of Features", y = "Density", color = "Percentile") +
  scale_color_manual(labels = c("0-19", "20-39", "40-59", "60-79", "80-100"),
                     values = brewer.pal(5, "OrRd")) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Some interesting relationships between the features and the finish percentile:

  • Use of Items (boosts, heals, and weapons_acquired): Players who finish higher tend to have used more boosts and healing items, and acquired more weapons. This is expected since they stayed in the game for a longer period and have more time to collect and use items. However, it would be interesting to explore which of these features is most predictive of a high finish placement.

  • Kills & Damage (damage_dealt, kill_place, and kills): Players who finish higher tend to have more kills. They also tend to have dealt more damage. However, in the top finishing group, there is a wide variety in how much damage they inflict. This could potentially indicate strategies that differ in their level of aggressiveness during the course of the game but are similarly successful in achieving a high placement.

  • Distance Traveled (walk_distance, swim_distance, and ride_distance): Players who finish higher tend to have walked farther. This is likely because they simply survive longer and are force to travel to stay in the safe zone, whereas players who die early don’t get a chance to travel very far. Both swimming and riding in vehicles are rare occurrences, though it appears that players who finish higher also tend to do more of both.

To summarize, many features are correlated with the finish percentile. However, we also want to take a look at the correlation between features, as that could present a problem when fitting models and determining feature rankings. We can look into this with a correlation plot.

Correlation Plot

corr_matrix = train_solo %>% 
  select(-id, -match_id) %>% 
  cor()

corrplot(corr_matrix, method = "color", type = "full", title = "Correlation Plot of Features")

Clearly, many features are highly correlated with each other. In particular, kills, kill_place, kill_streaks, and longest_kill are, as expected, highly correlated. Features that are sparser such as road_kills, swim_distance exhibit little correlation with other features. Looking at the individual features’ correlation with our outcome variable win_place_perc, we see that boosts, kill_place, and walk_distance are among the most predictive features (without adjusting for other features).

Because of the high correlation between features, we determined that it was important to account for that in our prediction methods. One such method we used was elastic net regression, which incorporates L2 regularization to deal with problems from highly correlated features. Elastic net regression also uses L1 regularization which shrinks coefficients to 0 and performs feature selection, which is of interest to us for feature ranking.

We also note that in the training set, n_groups = max_place for all matches, which means that we have data for all players up to the max_place player. Therefore, we remove the n_groups feature.

all.equal(train_solo$n_groups, train_solo$max_place)
[1] TRUE
train_solo = train_solo %>%
  select(-n_groups)

Dimension Reduction and Clustering

Another way we could investigate feature ranking is if we hypothesize that features that exhibit the most variability also best explain our outcome variable. Principal components analysis (PCA) is a popular dimension reduction method that accomplishes this task.

Principal Components Analysis

set.seed(1)
# PCA
X = train_solo %>%
  select(-win_place_perc, -id, -match_id) %>%
  as.matrix()

pc = prcomp(X, center = T, scale = T)
pc[[2]][,1:3] %>% kable()
PC1 PC2 PC3
boosts 0.3147751 -0.1622042 0.1686956
damage_dealt 0.3672756 0.2164339 -0.1067439
headshot_kills 0.2867165 0.2434542 -0.1197904
heals 0.2100020 -0.2544299 0.2051843
kill_place -0.3696846 -0.0397688 -0.0564925
kills 0.3726581 0.2387193 -0.1104184
kill_streaks 0.3204852 0.2265997 -0.0981452
longest_kill 0.3042571 0.1016077 -0.1001660
match_duration 0.0336809 -0.4802400 -0.3069652
max_place -0.0110764 0.0205827 -0.0705728
ride_distance 0.1315557 -0.4846559 -0.2985481
road_kills 0.0455555 -0.1201550 -0.5130499
swim_distance 0.0560272 -0.1532687 0.4969032
vehicle_destroys 0.0411344 -0.1223628 -0.2759644
walk_distance 0.2924897 -0.2906486 0.2586336
weapons_acquired 0.2405694 -0.2815642 0.1647059
# Plot eigenvalues (percentage of variances explained by each principal component)
fviz_eig(pc)

# Plot cumulative variance explained by the first k PCs
data.frame(cumvar = cumsum(pc$sdev^2/sum(pc$sdev^2))) %>%
  ggplot(aes(x = 1:length(cumvar), y = cumvar*100)) +
  geom_point() +
  geom_line() + 
  labs(title = "Cumulative Variance Explained by the First K PCs", 
       x = "K", y = "Percent of Variance Explained") +
  scale_y_continuous(limits = c(0, 100)) +
  theme_bw()

# Plot features in terms of the first two principal components
# Positively correlated features point in the same direction, negatively correlated features point to opposite direction
fviz_pca_var(pc,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE         # Avoid text overlapping
             )

The first principal component explains about a third of the variability we see in our features (35.6%), while the second to sixth principal component all explain between 5 to 10%. The first three components cover over 50% of the variation we see in our data.

We also visualize each feature’s contribution to the first two principal components and their correlations.

  1. PC 1: The first principal component comprises of features related to player’s damage and kills, primarily the features kills, kill_place, and damage_dealt. To a lesser extent, features like kill_steaks, longest_kill, headshot_kills, boosts, and heals also exhibit contribution primarily to the first principal component.

  2. PC 2: The second principal component is more characterized by details related to the match setting like match_duration, though ride_distance and walk_distance also contribute to this principal component.

  3. PC 3: We can examine the table of contributions to each principal component to deduce which features contribute the most to the third principal component. The third principal component explains variation in features not well-represented in the first two components, such as swim_distance and road_kills.

From PCA, we can conclude that there are two primary dimensions along which our observations vary in terms of their features. The first dimension relates to player actions, specifically, how much damage you dealt to other players and how many kills you have. The second dimension relates to match characteristics. From a player perspective, we are much more interested in the first principal component, since how you play is something within your control, while match characteristics are not.

# Plot points on principal components 1 and 2 
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2]) %>%
  ggplot(aes(PC1, PC2, color = train_solo$win_place_perc)) +
  geom_point(alpha = 0.5) +
  labs(title = "Win Place Percent of Training Data by PC1 and PC2",
       color = "Win Place Percent") +
  theme_bw()

# Plot points on principal components 1 and 3
data.frame(PC1 = pc$x[,1], PC3 = pc$x[,3]) %>%
  ggplot(aes(PC1, PC3, color = train_solo$win_place_perc)) +
  geom_point(aes(alpha = 0.5)) +
  labs(title = "Win Place Percent of Training Data by PC1 and PC3",
       color = "Win Place Percent") +
  guides(alpha = F) +
  theme_bw()

# Plot points on principal components 2 and 3
data.frame(PC2 = pc$x[,2], PC3 = pc$x[,3]) %>%
  ggplot(aes(PC2, PC3, color = train_solo$win_place_perc)) +
  geom_point(alpha = 0.5) +
  labs(title = "Win Place Percent of Training Data by PC2 and PC3",
       color = "Win Place Percent") +
  theme_bw()

We want to investigate whether the top principal components are predictive of the finish percentile. In the plot above, we plot observations by the top 3 principal components and colored the points by the finish percentile. Principal component 1 exhibits a strong correlation with the finish percentile, with higher values corresponding to a higher win place percentage. This is a promising result, since principal component 1 consists of player actions and that is what we are most interested in. In other words, as hypothesized, the largest variability we see in our features also appears to be predictive of our outcome variable.

Clustering of Winners

In this section, we attempt to answer our first question:

  1. Clustering: Which playing styles are most successful?

Since the probability of winning increases as the number of players decrease, it is to the players’ advantage to eliminate players. Thus, we can hypothesize that successful players may take two approaches to the game:

  1. Aggressive Strategy: Players drop in a populated location and attempt to take control of the location. This is a high-risk and high-reward strategy as players are more likely to die early in the game due to an increased number of encounters but are rewarded with superior weapons and boosts.

  2. Passive Strategy: Rather than eliminating other players yourself, the passive strategy relies on surviving until other players have been eliminated. This typically involves hiding to minimize encounters until only a few players remain, then selectively taking fights until the player is the only one remaining.

First, we attempted to identify these strategies in our data by looking at the distribution of the percentage of total players killed by winners.

train_solo %>% filter(win_place_perc == 1) %>%
  mutate(prop_killed = kills/max_place) %>%
  ggplot(aes(x = prop_killed)) +
  geom_histogram(aes(y = ..density..), color = 'white', binwidth = 0.01) +
  geom_density(color = 'blue') +
  labs(title = "Distribution in Proportion of Players Killed for Match Winners", 
       x = "Proportion of Players Killed", y = "Density") +
  theme_bw()

Most winners kill a substantial proportion of their opponents – on average, they eliminate 6.5% of players in the match. However, if there was a clear distinction between the two strategies in terms of proportion of players killed, we would expect to see a multi-modal distribution in the proportion of players killed for match winners. It is worth noting that there is a slight bump for winners with 0 to 0.1 proportion of players killed (where the winner may have only eliminated the second place player). However, it does not appear to be a sufficiently high bump to characterize the distribution as bimodal.

# PCA for winners
winners_solo = train_solo %>%
  filter(win_place_perc == 1)

X = winners_solo %>%
  select(-win_place_perc, -id, -match_id) %>%
  as.matrix()

pc = prcomp(X, center = T, scale = T)

# Plot feature contribution to PCs
fviz_pca_var(pc,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE         # Avoid text overlapping
             )

# Plot winners by kills
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2]) %>%
  ggplot(aes(PC1, PC2, color = (winners_solo$kills==0))) +
  geom_point(alpha = 0.5) +
  labs(title = "Winners by PC1 and PC2, colored by number of kills",
       color = "Number of kills = 0") +
  theme_bw()

# Plot winners by number of players in the match
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2]) %>%
  ggplot(aes(PC1, PC2, color = (winners_solo$max_place))) +
  geom_point(alpha = 0.5) +
  labs(title = "Winners by PC1 and PC2, colored by number of players",
       color = "Number of players in the match") +
  theme_bw()

# Plot winners by match duration
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2]) %>%
  ggplot(aes(PC1, PC2, color = (winners_solo$match_duration))) +
  geom_point(alpha = 0.5) +
  labs(title = "Winners by PC1 and PC2, colored by match duration",
       color = "Match duration") +
  theme_bw()

We use PCA on the winners to see if there are clear clusters. In fact, we do see some evidence of clusters. There is a smaller cluster to the left, which represents winners of matches with very few players, who did not have any kills. Then there is some separation in the larger cluster into two clusters that is defined by match duration. In other words, the clusters likely represent match characteristics rather than different playing styles.

Upon further consideration, even if the playing styles we described accurately reflect reality, we may not have the necessary data that would’ve helped us cluster them. Some data we would have liked to have include:

  1. Drop location: An important indicator of aggressiveness, as described earlier, is where a player chooses to drop on the map. We did not have this data and are therefore missing an important signal for playing style.

  2. Time of kill: Aggressive players would likely be killing throughout the whole game while passive players would have most of their kills concentrated in the end. Or there may be a hybrid strategy where players are aggressive in the beginning to loot more items, then keep a low profile until the very end. Thus, longitudinal data for when a player killed other players would have helped separate different strategies.

Since we only have data on the total number of kills (relative and absolute), that may not be a sufficient signal to separate playing strategies. On the other hand, it may also be possible that clusters for playing strategies don’t exist at all. Perhaps all winners pursue what we would consider an aggressive stratey, in which case we would not see separate clusters either.

Prediction Models & Feature Ranking

In this section, we aim to answer our remaining two questions:

  1. Model Prediction: How well can we predict a player’s finish placement? How well can we classify winners?
  2. Feature Ranking: What player actions or statistics are most predictive of their finish placement?

To answer these questions, we will fit the following models, test our model on an independent set, and examine feature importance scores:

  1. Linear regression
  2. Elastic net regression
  3. Random forest

Note: In training our models, we do not include match ID or player ID as features, since neither is going to generalize for predicting new games.

Linear Regression

First, we fit a linear regression model with 5-fold cross-validation. Since the win place percentage is a value between 0 and 1, we apply a log transformation (adding 1 to ensure all values are defined), so that it better fulfills the assumption that \(y\) is a continuous variable on the real line. However, this still doesn’t guarantee that the predicted values will be between 0 and 1, so we constrain the predicted value to be 0 if it is negative and 1 if it is above 1.

set.seed(1)
tc = trainControl(method = "cv", number = 5)

# Fit linear model
lm_model = train(log(win_place_perc + 1) ~ . - match_id - id, data = train_solo, 
                 method = "lm", trControl = tc)
print(lm_model)
Linear Regression 

99040 samples
   18 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 79232, 79232, 79232, 79231, 79233 
Resampling results:

  RMSE       Rsquared   MAE      
  0.0685495  0.8872068  0.0495968

Tuning parameter 'intercept' was held constant at a value of TRUE
# Save predictions
predictions_solo = data.frame(win_place_perc = val_solo$win_place_perc,
                              lm_pred = exp(predict(lm_model, val_solo)) - 1) %>%
  mutate(lm_pred = case_when(lm_pred < 0 ~ 0,
                             lm_pred > 1 ~ 1,
                             lm_pred >= 0 & lm_pred <= 1 ~ lm_pred)) # Constrain values to [0,1]

# Plot predicted vs true values
predictions_solo %>%
  ggplot(aes(x = win_place_perc, y = lm_pred)) +
  geom_point(alpha = 0.5, size = 0.1) +
  labs(title = "Predicted Versus Actual Win Place Percentage (Linear Regression)", 
       x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
  theme_bw()

Elastic Net Regression

As we noted earlier in the data exploration, some of the features are highly correlated and/or seem to provide little signal (such as distance traveled). To solve these issues, we implement elastic net regression which uses a ridge-regression-like penalty to adjust for correlated features and lasso penalty to shrink non-informative features to zero. Using 5-fold cross-validation in the training set, we tune regularization hyper parameters.

set.seed(1)
# Fit lasso_model using glmnet
lasso_model = train(log(win_place_perc + 1) ~ . - match_id - id, data = train_solo, 
                    method = "glmnet", trControl = tc)
print(lasso_model)
glmnet 

99040 samples
   18 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 79232, 79232, 79232, 79231, 79233 
Resampling results across tuning parameters:

  alpha  lambda        RMSE        Rsquared   MAE       
  0.10   0.0003230778  0.06856201  0.8871773  0.04988400
  0.10   0.0032307776  0.06913346  0.8856577  0.05166476
  0.10   0.0323077761  0.08220437  0.8464587  0.06555230
  0.55   0.0003230778  0.06856463  0.8871703  0.04985993
  0.55   0.0032307776  0.06962549  0.8844075  0.05234333
  0.55   0.0323077761  0.09562115  0.7970951  0.07656673
  1.00   0.0003230778  0.06858067  0.8871194  0.04986878
  1.00   0.0032307776  0.07029232  0.8826508  0.05319175
  1.00   0.0323077761  0.09996480  0.7933901  0.08073803

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.1 and lambda
 = 0.0003230778.
#coef(lasso_model$finalModel, lasso_model$bestTune$lambda)

# Add predictions to df
predictions_solo = predictions_solo %>%
  mutate(lasso_pred = exp(predict(lasso_model, val_solo)) - 1,
         lasso_pred = case_when(lasso_pred < 0 ~ 0,
                                lasso_pred > 1 ~ 1,
                                lasso_pred >=0 & lasso_pred <= 1 ~ lasso_pred)) # Constrain values to [0,1]

# Plot predicted vs true values
predictions_solo %>%
  ggplot(aes(x = win_place_perc, y = lasso_pred)) +
  geom_point(alpha = 0.5, size = 0.1) +
  labs(title = "Predicted Versus Actual Win Place Percentage (Lasso Regression)", 
       x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
  theme_bw()

Random Forest

To account for interactions between features, we can use a random forest model. Ensemble methods like random forest are known to generally perform better than regression models. Due to computational costs, however, we make the following choices:

  1. Train the random forest model on ~10,000 observations (as opposed to ~60,000 observations).
  2. Use 100 trees in the random forest.

Again with 5-fold cross-validation, we tune hyper parameters for random forest.

set.seed(1)
# Warning: Takes a while to run (~3 minutes)
imp_features = varImp(lasso_model$finalModel) %>%  
  mutate(variable = rownames(.)) %>%
  arrange(desc(Overall)) %>%
  #slice(1:10) %>%
  pull(variable)

rf_model = train(as.formula(paste("win_place_perc ~ ", paste(imp_features, collapse = "+"))), 
                 data = sample_n(train_solo, 10000), method = "rf", ntree = 100, importance = T, trControl = tc)
print(rf_model)
Random Forest 

10000 samples
   16 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 8000, 8000, 8001, 8000, 7999 
Resampling results across tuning parameters:

  mtry  RMSE        Rsquared   MAE       
   2    0.08672402  0.9223775  0.06691531
   9    0.06857612  0.9457451  0.04739896
  16    0.07026248  0.9430108  0.04814976

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
# Add predictions to df
predictions_solo = predictions_solo %>%
  mutate(rf_pred = predict(rf_model, val_solo))

# Plot predicted vs true values
predictions_solo %>%
  ggplot(aes(x = win_place_perc, y = rf_pred)) +
  geom_point(alpha = 0.5, size = 0.1) +
  labs(title = "Predicted Versus Actual Win Place Percentage (Random Forest)", 
       x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
  theme_bw()

Note that the spread of points narrows for players that place lower (actual win place percentage approaches 0). This indicates that we are able to predict the finish percentile more accurately for players that place lower.

Feature Ranking

Which features are most predictive? With the generic varImp function, we can compare the relative importance of features. The meaning of varImp for each model is:

  1. Linear regression: absolute value of the t-statistic
  2. Elastic net: absolute value of coefficients
  3. Random forest: importance scores, which corresponds to the decrease in out of bag accuracy when each feature is randomly permuted
# Table of variable importance scores
lm_imp = varImp(lm_model$finalModel) %>%        # absolute value of t-statistic
  mutate(variable = rownames(.)) %>%
  arrange(desc(Overall)) %>%
  rename(lm_imp = Overall)

lasso_imp = varImp(lasso_model$finalModel) %>%  # absolute value of coefficients
  mutate(variable = rownames(.)) %>%
  arrange(desc(Overall)) %>%
  rename(lasso_imp = Overall)

rf_imp = varImp(rf_model$finalModel) %>%        # importance scores
  mutate(variable = rownames(.)) %>%
  arrange(desc(Overall)) %>%
  rename(rf_imp = Overall)

imp = full_join(lm_imp, lasso_imp, by = "variable") %>%
  full_join(., rf_imp, by = "variable") %>%
  select(variable, everything()) %>%
  mutate(variable = factor(variable, levels = variable[order(lasso_imp, decreasing = F)])) %>%
  gather(model, importance, -variable) %>%
  mutate(model = factor(model),
         model = fct_recode(model, 
                             "Elastic Net" = "lasso_imp",
                             "Linear" = "lm_imp",
                             "Random Forest" = "rf_imp"))

# Make a plot for variable importance
imp %>%
  ggplot(aes(x = variable, y = importance, fill = model)) +
  geom_col(position = "dodge") + 
  facet_wrap(~ model, scales = "free_x") +
  coord_flip() +
  labs(title = "Feature Ranking",
       x = "Feature",
       y = "Importance",
       fill = "Model") +
  scale_fill_manual(labels = c("Elastic Net", "Linear", "Random Forest"),
                    values = c("#999999", "#E69F00", "#56B4E9")) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "none")

Unexpectedly, there is not much agreement in the features that each model regards to be important. One explanation for this is the presence of highly correlated features in our data; for example, if we look at the features kill_place and kill_streaks, the former is rated as highly important by the linear regression and random forest (but not elastic net regression), while the latter is rated as highly important by elastic net regression (but not random forest). However, the two features are known to be highly correlated with each other. While random forest performs best in prediction, the caveat with its importance score metric is that if multiple features are all predictive of the outcome but are highly correlated with the outcome, the importance score of each feature is going to be suppressed.

Since elastic net regression accounts for correlation between features, we will primarily use its variable importance values to summarize our findings for ranking the importance of player characteristics and actions:

  1. The most important predictor is the number of kills a player makes. Elastic net regression chooses kill_streaks as the most predictive among the different features related to kills. What this suggests is that while some players may succeed with less confrontational playing strategies, you need to kill other players or you will be eliminated.

  2. Among items used, weapons_acquired and boosts are the strongest predictors of outcome. This is interesting because one would expect that among the features related to item acquisition and usage, weapons would be the dominant predictor. However, the acquisition of weapons may plateau over time once you have strong weapons. On the other hand, boosts enable increased health regeneration over time and have a small movement speed bonus when the amount of boosts consumed is beyond a particular threshold. Players tend to save boosts as an additional advantage in the later stages of the game when most players have powerful weapons. Consequently, the number of boosts consumed is also a strong indicator of a successful player.

Comparison of Models

We will compare our models with the following metrics on the validation set:

  1. Mean absolute error (MAE): Represents the average absolute deviation. \[MAE = \frac{\sum_{i=1}^n |\hat{y}_i - y_i|}{n}\]

  2. Self-defined accuracy metric (SDAM(x)): This metric is a function of a cutoff value \(x\). If the predicted outcome is within \(x\%\) of the actual win place percentage, we classify it as a “correct” prediction. Otherwise, it is an incorrect prediction.
    \[SDAM(x) = \frac{\sum_{i=1}^n \mathbb{1}_{|\hat{y}_i - y_i| <= x}}{n}\]

  3. Classification of Winners: We can compute the ROC curve by turning our predictions into a classification problem. Given a predicted win place percentage, we classify the player as a winner if its predicted value is less than a cutoff value \(x\%\). For different cutoff values, we can then compute the sensitivity (true positive rate, or the proportion of actual winners we classify as such) and specificity (true negative rate, or the proportion of actual losers we classify as such).

predictions_solo = predictions_solo %>%
  mutate(lm_ae = abs(lm_pred - win_place_perc),
         lasso_ae = abs(lasso_pred - win_place_perc),
         rf_ae = abs(rf_pred - win_place_perc))

# Summarize metrics
metrics_solo = predictions_solo %>%
  mutate(lm_correct = lm_ae <= 0.05,
         lasso_correct = lasso_ae <= 0.05,
         rf_correct = rf_ae <= 0.05) %>%
  summarize(lm_mae = mean(lm_ae),
            lm_sdam = mean(lm_correct),
            lasso_mae = mean(lasso_ae),
            lasso_sdam = mean(lasso_correct),
            rf_mae = mean(rf_ae),
            rf_sdam = mean(rf_correct)) %>%
  gather() %>%
  separate(key, c("model", "metric"))

# Plot MAE
metrics_solo %>%
  filter(metric == "mae") %>%
  ggplot(aes(x = model, y = value, label = round(value, 3))) +
  geom_col() +
  geom_label() +
  labs(title = "Comparison of MAE", x = "Model", y = "MAE") +
  scale_x_discrete(labels = c("Elastic Net", "Linear", "Random Forest")) +
  theme_bw()

# Plot SDAM
cutoff = seq(0, 0.5, by = 0.01)
prop_lm = sapply(cutoff, function(x) mean(predictions_solo$lm_ae < x))
prop_lasso = sapply(cutoff, function(x) mean(predictions_solo$lasso_ae < x))
prop_rf = sapply(cutoff, function(x) mean(predictions_solo$rf_ae < x))

data.frame(cutoff = rep(cutoff, 3), 
           prop = c(prop_lm, prop_lasso, prop_rf),
           model = rep(c("Linear", "Elastic Net", "Random Forest"), each = length(cutoff))) %>%
  ggplot(aes(x = cutoff, y = prop, color = model)) + 
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0.05, linetype = "longdash", color = "slategray") +
  annotate(geom = "text", x = 0.05, label="SDAM(5)", y = 0.8, colour = "slategray", angle = 90, vjust = -1.1) +
  labs(title = "Comparison of SDAM",
       subtitle = "Proportion of Predictions Within x% of Actual Win Place Percentage",
       x = "|Predicted - Actual Win Place Percentage|",
       y = "SDAM",
       color = "Model") +
  theme_bw()

# Plot ROC
predictions_solo = predictions_solo %>%
  mutate(winner = ifelse(win_place_perc == 1, 1, 0))

cutoff = seq(0, 1, by = 0.01)
sens_lm = sapply(cutoff, function(x) sum(predictions_solo$lm_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_lm = sapply(cutoff, function(x) sum(predictions_solo$lm_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))
sens_lasso = sapply(cutoff, function(x) sum(predictions_solo$lasso_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_lasso = sapply(cutoff, function(x) sum(predictions_solo$lasso_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))
sens_rf = sapply(cutoff, function(x) sum(predictions_solo$rf_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_rf = sapply(cutoff, function(x) sum(predictions_solo$rf_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))

data.frame(cutoff = rep(cutoff, 3),
               sens = c(sens_lm, sens_lasso, sens_rf),
               spec = c(spec_lm, spec_lasso, spec_rf),
               model = rep(c("Linear", "Elastic Net", "Random Forest"), each = length(cutoff))) %>%
  ggplot(aes(x = 1-spec, y = sens, color = model)) + 
  geom_point() + 
  geom_line() + 
  theme_bw()  +
  labs(title = "ROC Curve",
       x = "False Positive Rate (1 - Specificity)",
       y = "Sensitivity",
       color = "Model")

From the above plots, we can see that random forest performs best on all three metrics. This is despite the restrictions we had to place in order to run the random forest model in a reasonable amount of time. Its MAE is 0.048, which means that on average, the predicted win place percentage is 0.048 off from the true finish percentile. Its SDAM is consistently higher than the SDAM for linear regression or elastic net regression. For example, its SDAM(5) is 0.645, which means that for 4.8% of observations in our validation set, the predicted value is within 5% of the true win place percentage. Lastly, the area under its ROC curve is the largest, which indicates that it performs best at classifying who the winners are (sensitivity = 0.9381107, specificity = 0.8520173).

Performance on Test Set

As we have found that the random forest model has the highest accuracy in the training set, we validate our results on an independent test set.

pred_test = data.frame(rf_test_pred = predict(rf_model, test_solo)) %>%
  mutate(win_place_perc = test_solo$win_place_perc,
         rf_test_ae = abs(rf_test_pred - win_place_perc),
         rf_test_correct = rf_test_ae <= 0.05,
         winner = ifelse(win_place_perc == 1, 1, 0))

# Plot SDAM
cutoff = seq(0, 0.5, by = 0.01)
prop_test = sapply(cutoff, function(x) mean(pred_test$rf_test_ae < x))
data.frame(cutoff = rep(cutoff, 2), 
           prop = c(prop_rf, prop_test),
           set = rep(c("Validation", "Test"), each = length(cutoff))) %>%
  ggplot(aes(x = cutoff, y = prop, color = set)) + 
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0.05, linetype = "longdash", color = "slategray") +
  annotate(geom = "text", x = 0.05, label="SDAM(0.05)", y = 0.8, colour = "slategray", angle = 90, vjust = -1.1) +
  theme_bw() +
  labs(title = "Comparison of SDAM",
       subtitle = "Proportion of Predictions Within x% of Actual Win Place Percentage",
       x = "|Predicted - Actual Win Place Percentage|",
       y = "SDAM",
       color = "Dataset") +
  theme_bw()

# Plot ROC
cutoff = seq(0, 1, by = 0.01)
sens_test_rf = sapply(cutoff, function(x) sum(pred_test$rf_test_pred > x & pred_test$winner)/sum(pred_test$winner))
spec_test_rf = sapply(cutoff, function(x) sum(pred_test$rf_test_pred <= x & !pred_test$winner)/sum(!pred_test$winner))

data.frame(cutoff = rep(cutoff, 2),
           sens = c(sens_rf, sens_test_rf),
           spec = c(spec_rf, spec_test_rf),
           set = rep(c("Validation", "Test"), each = length(cutoff))) %>%
  ggplot(aes(x = 1-spec, y = sens, color = set)) + 
  geom_point() + 
  geom_line() + 
  labs(title = "ROC Curve",
       x = "False Positive Rate (1 - Specificity)",
       y = "Sensitivity",
       color = "Dataset") +
  theme_bw()

pred_test_metrics = pred_test %>%
  summarize(rf_test_mae = mean(rf_test_ae),
            rf_test_sdam = mean(rf_test_correct))

On the test set, random forest has a MAE of 0.047 and a SDAM(5) of 0.653, which is similar to what we saw in the validation set. From the above plots of SDAM and ROC, we also see that the random forest model performs similarly well in the test set as in the validation set.

Narrative and Summary

As casual gamers and biostatistics students, we are interested in features and strategies that characterize the top PUBG players. To answer these questions, we looked at player data for solo players in PUBG. The data consisted of 16 features describing player kills, damage, items used and acquired, and match attributes.

Best performing model: random forest

To predict finish placement percentile, we used linear regression, elastic net regression, and random forest. Using mean absolute error (MAE), our self-defined accuracy metric (SDAM) quantifying the proportion of predictions that fall within x% of the true win place percentage, and the ROC curve for classification of winners, we found that the random forest model performed best on the validation set. On an independent test set, the predicted finish percentile was on average 0.047 from the true finish percentile. We also achieved a sensitivity of 0.88 and specificity 0.86, meaning that we are able to correctly classify 88% of winners correctly while correctly classifying 86% of losers.

Most predictive player features

In our exploratory data analysis, we noted that features like kills, boosts, and walk_distance are highly correlated with finish placement percentile (win_place_percentile). These results were largely in agreement with our feature ranking analysis, where we determined that among player actions, their kills were the most predictive, followed by the acquisition and usage of weapons and boosts.

Limitations

Our data analysis has a number of limitations. The first is that for computational purposes, we only used around 20% of the total amount of solo-player data. In the future, we could consider repeating the analysis with a larger data set. Second, we only examined solo player data, so our conclusions are not generalizable to group gameplay, which is complicated by aspects of teamwork such as revival and positioning. Finally, we could have explored using the Beta distribution to model our outcome of interest since the final win place percentile is a value between 0 and 1 as well as other machine learning methods like boosting.

In addition, we were unfortunately unable to identify clusters of different playing strategies among winners, due (at least in part) to the limitations of the data. However, our data analysis indicates that aggressive players who have more kills relative to other players do perform better on average and that winners tended to have a high number of kills.